-
Notifications
You must be signed in to change notification settings - Fork 17
Add --all-ranks CLI infrastructure and core processing logic #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
If LLMs were used please post the prompts, thanks! You can use #114 as a model. (If not used that is fine too, I just want to check!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped some nits on code dup, but I can see how the duplication might be intentional because all-rank processing might end up (or already is) diverging from one-rank processing, so you make the call:).
@@ -92,11 +100,12 @@ fn main() -> anyhow::Result<()> { | |||
strict: cli.strict, | |||
strict_compile_id: cli.strict_compile_id, | |||
custom_parsers: Vec::new(), | |||
custom_header_html: cli.custom_header_html, | |||
custom_header_html: cli.custom_header_html.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why clone now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken, this is necessary because ParseConfig
needs to own the string and cli is passed by reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The emphasis of the question is on "why now" rather than just "why", the code compiled and seemed to work already before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what happened is during your changes, cli
was changed to a reference. It no longer is in this current PR version, so this .clone() shouldn't be necessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, in my current cli.rs file, I would need the clone in handle_one_rank() because handle_one_rank takes in &Cli and can't move the String out of a borrowed struct. Is this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clone happens because you want to create ParseConfig using a Cli ref, but do you need 1 ParseConfig per rank?
Hint: Are there differences between the ParseConfig created for each rank?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm since ParseConfig is the same for all ranks, I could create ParseConfig once before processing ranks. And then pass that to handle_one_rank instead of passing cli. This would avoid repeated cloning of custom_header_html.
Is this the direction you're suggesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds reasonable
src/cli.rs
Outdated
|
||
// Extract rank number from filename | ||
let rank_num = if let Some(pos) = rank_name.find("rank_") { | ||
let after_rank = &rank_name[pos + 5..]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take a look at String.strip_prefix
src/cli.rs
Outdated
let after_rank = &rank_name[pos + 5..]; | ||
after_rank | ||
.chars() | ||
.take_while(|c| c.is_ascii_digit()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the log file name patterns you need to match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it supports rank_N.log where N is a numeric variable. The logic uses strip_prefix("rank_") then extracts consecutive ASCII digits as the rank number. Ex:
rank_0.log
,rank_1.log
,rank_10.log
rank_0_worker.log
something_rank_5.log
But rejects files without the "rank_" prefix or without digits after it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, does the compiler logs actually output log file names with that variance? You only need to handle the filenames logged via TORCH_TRACE right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I'll take out the unnecessary complexity.
src/cli.rs
Outdated
} | ||
|
||
// Add link to this rank's page | ||
rank_links.push((rank_num.clone(), format!("rank_{rank_num}/index.html"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't want to hard code this here. Suppose someone changes the single rank codepath to change the directory where index.html is stored, then this line would break! Is there a way we could ensure that this line remains relevant even if the single rank logic changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed! Now it uses the actual output path returned by handle_one_rank()
instead of hardcoding "index.html".
I tested this by running both single rank and multi-rank processing with different output directories and verified that the generated links correctly point to the actual output files in their respective subdirectories, so that it will remain correct even if the single-rank output structure changes.
src/cli.rs
Outdated
} | ||
|
||
fn main() -> anyhow::Result<()> { | ||
let cli = Cli::parse(); | ||
|
||
if cli.all_ranks { | ||
return handle_all_ranks(&cli); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cli is no longer needed after handle_all_ranks, which means you can take by value, which in rust defaults to move semantics unless the Cli type implements the Copy trait (it doesn't).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend splitting this PR up a bit. There's pre-existing code that can be refactored e.g. PathBuf logic, handle_one_rank helper. Then there can be another PR to add handle_all_ranks
|
||
// Add link to this rank's page using the actual output path from handle_one_rank | ||
let rank_link = format!("rank_{}/{}", rank_num, main_output_path.display()); | ||
rank_links.push((rank_num.clone(), rank_link)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rank_links.push((rank_num.clone(), rank_link)); | |
rank_links.push((rank_num, rank_link)); |
rank_path: &PathBuf, | ||
rank_out_dir: &PathBuf, | ||
cli: &Cli, | ||
create_output_dir: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create_output_dir
seems avoidable
Try rewriting your callsites, so that this function either always creates the output directory, or the callsite always creates the output directory before calling into this.
Hint: can you rewrite all your newly added directory creating logic using setup_output_directory
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, this would mean to remove the create_output_dir parameter and move directory creation to the callsites:
Single rank: uses existing setup_output_directory
Multi-rank: calls fs::create_dir(&rank_out_dir)? before handle_one_rank
Is this what you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that, or you could always call setup_output_directory from within handle_one_rank
let path = if cli.latest { | ||
let input_path = cli.path; | ||
let input_path = &cli.path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't look like this change is needed
handle_one_rank(&path, &out_path, &cli, false)?; | ||
|
||
if !cli.no_browser { | ||
opener::open(out_path.join("index.html"))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's MAIN_OUTPUT_FILENAME for?
} | ||
|
||
// Extract rank number from the pattern | ||
let after_prefix = &filename[31..]; // Remove "dedicated_log_torch_trace_rank_" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strip_prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at line 225, looks very very similar. could we avoid computing things twice?
Maybe it's easier to express if you didn't use .filter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid computing things twice, I'm intending to use filter_map to extract rank numbers once during collection, then returning Vec<(DirEntry, String)>. Is this approach feasible to address your feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that can work. I was just thinking of falling back this logic to a foreach loop
}; | ||
|
||
// Only support PyTorch TORCH_TRACE files: dedicated_log_torch_trace_rank_0_hash.log | ||
if !filename.starts_with("dedicated_log_torch_trace_rank_") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary:
Future work: